Search CORE

72 research outputs found

A framework of dynamic data structures for string processing

Author: Prezza N.
Publication venue
Publication date: 01/01/2017
Field of study

In this paper we present DYNAMIC, an open-source C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gap-encoded bitvectors, and entropy/run-length compressed strings and FM indexes. We prove close-to-optimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of five recently-published compression algorithms implemented using DYNAMIC with those of stateof-the-art tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more space-efficient (albeit slower) than classical ones performing the same tasks

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Optimal rank and select queries on dictionary-compressed text

Author: Prezza N.
Publication venue
Publication date: 21/12/2018
Field of study

We study the problem of supporting queries on a string S of length n within a space bounded by the size \u3b3 of a string attractor for S. In the paper introducing string attractors it was shown that random access on S can be supported in optimal O(log(n/\u3b3)/ log log n) time within O (\u3b3 polylog n) space. In this paper, we extend this result to rank and select queries and provide lower bounds matching our upper bounds on alphabets of polylogarithmic size. Our solutions are given in the form of a space-time trade-off that is more general than the one previously known for grammars and that improves existing bounds on LZ77-compressed text by a log log n time-factor in select queries. We also provide matching lower and upper bounds for partial sum and predecessor queries within attractor-bounded space, and extend our lower bounds to encompass navigation of dictionary-compressed tree representations

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Faster Online Computation of the Succinct Longest Previous Factor Array

Author: Prezza N.
Rosone G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

We consider the problem of computing online the Longest Previous Factor array LPF[1, n] of a text T of length n. For each, LPF[i] stores the length of the longest factor of T with at least two occurrences, one ending at i and the other at a previous position. We present an improvement over the previous solution by Okanohara and Sadakane (ESA 2008): our solution uses less space (compressed instead of succinct) and runs in time, thus being faster by a logarithmic factor. As a by-product, we also obtain the first online algorithm computing the Longest Common Suffix (LCS) array (that is, the LCP array of the reversed text) in time and compressed space. We also observe that the LPF array can be represented succinctly in 2n bits. Our online algorithm computes directly the succinct LPF and LCS arrays

Archivio della Ricerca - Università di Pisa

Space-efficient computation of the LCP array from the Burrows-Wheeler transform

Author: Prezza N.
Rosone G.
Publication venue: place:Leibniz
Publication date: 01/01/2019
Field of study

We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, \u3c3] can be computed from the Burrows-Wheeler transformed collection in O(n log \u3c3) time using o(n log \u3c3) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Special Issue on Algorithms and Data-Structures for Compressed Computation

Author: Policriti A.
Prezza N.
Publication venue
Publication date: 01/01/2022
Field of study

As the production of massive data has outpaced Moore’s law in many scientific areas, the very notion of algorithms is transforming [...

Archivio istituzionale della ricerca - Università degli Studi di Udine

Directory of Open Access Journals

Adaptive learning of compressible strings

Author: Fici G
Prezza N
Venturini R
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

Suppose an oracle knows a string S that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is s a substring of S?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm needs to ask the oracle Sigma n/4 - O(n) queries in order to be able to reconstruct the hidden string, where Sigma is the size of the alphabet of S and n its length, and gave an algorithm that spends (Sigma - 1)n + O(Sigma root n) queries to reconstruct S. The main contribution of our paper is to improve the above upper-bound in the context where the string is compressible. We first present a universal algorithm that, given a (computable) compressor that compresses the string to Tau bits, performs q = O(Tau) substring queries; this algorithm, however, runs in exponential time. For this reason, the second part of the paper focuses on more time-efficient algorithms whose number of queries is bounded by specific compressibility measures. We first show that any string of length n over an integer alphabet of size Sigma with rle runs can be reconstructed with q = O(rle(Sigma + log nrle)) substring queries in linear time and space. We then present an algorithm that spends q is an element of O (Sigma g log n) substring queries and runs in O (n(logn + log Sigma) + q) time using linear space, where g is the size of a smallest straight-line program generating the string. (c) 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio istituzionale della ricerca - Università di Palermo

String attractors : Verification and optimization

Author: Kempa D.
Policriti A.
Prezza N.
Rotenberg E.
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2018
Field of study

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set γ ⊆ [1.n] is a k-attractor for a string S ∈ Σn if and only if every distinct substring of S of length at most k has an occurrence crossing at least one of the positions in γ. Finding the smallest k-attractor is NP-hard for k ≥ 3, but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the k-attractor problem to a set-cover instance where the string's positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a k-attractor in near-optimal time and how to quickly compute exact solutions. For example, we prove that a minimum 3-attractor can be found in O(n) time when |Σ| ∈ O(3+ϵ√log n) for some constant ϵ > 0, despite the problem being NP-hard for large Σ. © Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg.Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Helsingin yliopiston digitaalinen arkisto

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Gsufsort: Constructing suffix arrays, LCP arrays and BWTs for string collections

Author: Gog S.
Louza F. A.
Prezza N.
Rosone G.
Telles G. P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Background: The construction of a suffix array for a collection of strings is a fundamental task in Bioinformatics and in many other applications that process strings. Related data structures, as the Longest Common Prefix array, the Burrows-Wheeler transform, and the document array, are often needed to accompany the suffix array to efficiently solve a wide variety of problems. While several algorithms have been proposed to construct the suffix array for a single string, less emphasis has been put on algorithms to construct suffix arrays for string collections. Result: In this paper we introduce gsufsort, an open source software for constructing the suffix array and related data indexing structures for a string collection with N symbols in O(N) time. Our tool is written in ANSI/C and is based on the algorithm gSACA-K (Louza et al. in Theor Comput Sci 678:22-39, 2017), the fastest algorithm to construct suffix arrays for string collections. The tool supports large fasta, fastq and text files with multiple strings as input. Experiments have shown very good performance on different types of strings. Conclusions: gsufsort is a fast, portable, and lightweight tool for constructing the suffix array and additional data structures for string collections

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Fast randomized approximate string matching with succinct hash data structures

Author: A Policriti
A Policriti
Alberto Policriti
B Ewing
B Langmead
F Vezzi
H Li
HL Chan
N Prezza
Nicola Prezza
P Ferragina
R Cole
R Li
R Li
Y Takenaka
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Indexing k-mers in linear space for quality value compression.

Author: Břinda K
Matteo Comin
Mohamadi H
Ochoa I
Prezza N
Schimd M
Shibuya Y
Yoshihiro Shibuya
Publication venue
Publication date: 01/01/2019
Field of study

Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff

Crossref

Open Access Repository

Archivio istituzionale della ricerca - Università di Padova